The task of response selection in multi-turn dialogue is to find the best option from all candidates. In order to improve the reasoning ability of the model, previous studies pay more attention to using explicit algorithms to model the dependencies between utterances, which are deterministic, limited and inflexible. In addition, few studies consider differences between the options before and after reasoning. In this paper, we propose an Implicit Relational Reasoning Graph Network to address these issues, which consists of the Utterance Relational Reasoner (URR) and the Option Dual Comparator (ODC). URR aims to implicitly extract dependencies between utterances, as well as utterances and options, and make reasoning with relational graph convolutional networks. ODC focuses on perceiving the difference between the options through dual comparison, which can eliminate the interference of the noise options. Experimental results on two multi-turn dialogue reasoning benchmark datasets MuTual and MuTual+ show that our method significantly improves the baseline of four pretrained language models and achieves state-of-the-art performance. The model surpasses human performance for the first time on the MuTual dataset.
translated by 谷歌翻译
现有的基于密钥帧的运动合成主要集中于循环动作或短期运动的产生,例如步行,跑步和近距离姿势之间的过渡。但是,这些方法将在处理复杂和即兴运动时,例如舞蹈表演和武术时会大大降低合成运动的自然性和多样性。此外,当前的研究缺乏对生成的运动的细粒度控制,这对于智能的人类计算机互动和动画创作至关重要。在本文中,我们提出了一个基于多个约束的新型基于关键的运动生成网络,该网络可以通过学习的知识来实现​​多样化的舞蹈综合。具体而言,该算法主要基于复发性神经网络(RNN)和变压器体系结构制定。我们网络的骨干是由两个长期记忆(LSTM)单元组成的层次RNN模块,其中第一个LSTM用于将历史框架的姿势信息嵌入潜在空间中,第二个LSTM用于使用第二个LSTM,并且使用了第二个LSTM。预测下一帧的人类姿势。此外,我们的框架包含两个基于变压器的控制器,这些控制器分别用于建模根轨迹和速度因子的约束,以更好地利用框架的时间上下文并实现细粒度的运动控制。我们在包含各种现代舞蹈的舞蹈数据集上验证了拟议的方法。三个定量分析的结果验证了我们算法的优势。视频和定性实验结果表明,我们算法产生的复杂运动序列即使是长期合成,也可以在关键帧之间实现多种和平滑的运动过渡。
translated by 谷歌翻译
金融领域的数值推理 - 进行定量分析并总结了财务报告中的信息 - 可以大大提高业务效率并降低数十亿美元的成本。在这里,我们提出了一个数值推理问答系统,以回答财务文本和表数据源之间的数值推理问题,该问题由回收器模块,发电机模块和集合模块组成。具体而言,除了检索整个行数据外,我们还创新设计了一个细胞回收器,该池检索器可以检索金单元,以避免将同一行中的无关和相似的单元带到发电机模块的输入中。在发电机模块中,我们利用多个发电机来生产程序,这是回答问题的操作步骤。最后,在整体模块中,我们集成了多个程序,以选择最佳程序作为系统的输出。在FinQA竞争中的最终私人测试集中,我们的系统获得了69.79的执行精度。
translated by 谷歌翻译
本文提出了一个简单的基线框架,用于基于视频的2D/3D人姿势估计,该估计可以比现有作品实现10倍提高效率,而无需任何性能降级,名为Deciwatch。与当前在视频中估算每个帧的解决方案不同,Deciwatch引入了一个简单而有效的样品探测框架框架,该框架只能通过人类动作的连续性和轻巧的姿势表示,仅观看稀疏采样的框架。具体而言,DeciWatch均匀地示例少于10%的视频帧以进行详细估计,以有效的变压器体系结构来确定估计的2D/3D姿势,然后使用另一个基于变压器的网络准确地恢复其余帧。通过四个数据集的三个基于视频的人姿势估计和身体网格恢复任务的全面实验结果验证了Deciwatch的效率和有效性。代码可在https://github.com/cure-lab/deciwatch上找到。
translated by 谷歌翻译
When using LiDAR semantic segmentation models for safety-critical applications such as autonomous driving, it is essential to understand and improve their robustness with respect to a large range of LiDAR corruptions. In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions. To rigorously evaluate the robustness and generalizability of current approaches, we propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy. Then, we systematically investigate 11 LiDAR semantic segmentation models, especially spanning different input representations (e.g., point clouds, voxels, projected images, and etc.), network architectures and training schemes. Through this study, we obtain two insights: 1) We find out that the input representation plays a crucial role in robustness. Specifically, under specific corruptions, different representations perform variously. 2) Although state-of-the-art methods on LiDAR semantic segmentation achieve promising results on clean data, they are less robust when dealing with noisy data. Finally, based on the above observations, we design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications. It is promising that our benchmark, comprehensive analysis, and observations can boost future research in robust LiDAR semantic segmentation for safety-critical applications.
translated by 谷歌翻译
In recent years, arbitrary image style transfer has attracted more and more attention. Given a pair of content and style images, a stylized one is hoped that retains the content from the former while catching style patterns from the latter. However, it is difficult to simultaneously keep well the trade-off between the content details and the style features. To stylize the image with sufficient style patterns, the content details may be damaged and sometimes the objects of images can not be distinguished clearly. For this reason, we present a new transformer-based method named STT for image style transfer and an edge loss which can enhance the content details apparently to avoid generating blurred results for excessive rendering on style features. Qualitative and quantitative experiments demonstrate that STT achieves comparable performance to state-of-the-art image style transfer methods while alleviating the content leak problem.
translated by 谷歌翻译
Optical coherence tomography (OCT) captures cross-sectional data and is used for the screening, monitoring, and treatment planning of retinal diseases. Technological developments to increase the speed of acquisition often results in systems with a narrower spectral bandwidth, and hence a lower axial resolution. Traditionally, image-processing-based techniques have been utilized to reconstruct subsampled OCT data and more recently, deep-learning-based methods have been explored. In this study, we simulate reduced axial scan (A-scan) resolution by Gaussian windowing in the spectral domain and investigate the use of a learning-based approach for image feature reconstruction. In anticipation of the reduced resolution that accompanies wide-field OCT systems, we build upon super-resolution techniques to explore methods to better aid clinicians in their decision-making to improve patient outcomes, by reconstructing lost features using a pixel-to-pixel approach with an altered super-resolution generative adversarial network (SRGAN) architecture.
translated by 谷歌翻译
With the increasing ability of large language models (LLMs), in-context learning (ICL) has become a new paradigm for natural language processing (NLP), where LLMs make predictions only based on contexts augmented with a few training examples. It has been a new trend exploring ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress, challenges, and future work in ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques of ICL, including training strategies, prompting strategies, and so on. Finally, we present the challenges of ICL and provide potential directions for further research. We hope our work can encourage more research on uncovering how ICL works and improving ICL in future work.
translated by 谷歌翻译
Gaze estimation is the fundamental basis for many visual tasks. Yet, the high cost of acquiring gaze datasets with 3D annotations hinders the optimization and application of gaze estimation models. In this work, we propose a novel Head-Eye redirection parametric model based on Neural Radiance Field, which allows dense gaze data generation with view consistency and accurate gaze direction. Moreover, our head-eye redirection parametric model can decouple the face and eyes for separate neural rendering, so it can achieve the purpose of separately controlling the attributes of the face, identity, illumination, and eye gaze direction. Thus diverse 3D-aware gaze datasets could be obtained by manipulating the latent code belonging to different face attributions in an unsupervised manner. Extensive experiments on several benchmarks demonstrate the effectiveness of our method in domain generalization and domain adaptation for gaze estimation tasks.
translated by 谷歌翻译
Generalizability to unseen forgery types is crucial for face forgery detectors. Recent works have made significant progress in terms of generalization by synthetic forgery data augmentation. In this work, we explore another path for improving the generalization. Our goal is to reduce the features that are easy to learn in the training phase, so as to reduce the risk of overfitting on specific forgery types. Specifically, in our method, a teacher network takes as input the face images and generates an attention map of the deep features by a diverse multihead attention ViT. The attention map is used to guide a student network to focus on the low-attended features by reducing the highly-attended deep features. A deep feature mixup strategy is also proposed to synthesize forgeries in the feature domain. Experiments demonstrate that, without data augmentation, our method is able to achieve promising performances on unseen forgeries and highly compressed data.
translated by 谷歌翻译